Beta Regression

Team Bee (Beta Regressionists) (Advisor: Dr. Seals)

Anaite Montes Bu, Travis Keep

Introduction

  • Regression analysis is a statistical tool used to explore relationships between variables.

  • Beta Regression: When the dependent variable is a ratio or percentage.

Shortfalls of Normal Regression

  • Assumes dependent variable is normally distributed
  • Assumes variance is constant throughout

Generally not true with ratios or percentages

When dependent variable lies in (0, 1)

  • True of ratios or percents such as test scores
  • Variance typically less near the extremes e.g 0 or 1.

Reading Skills data set

  • Reading skills based on IQ and if student has dyslexia N=44
  • Normal regression shows IQ alone not significant in predicting reading score
  • Beta regression shows IQ alone is significant in predicting reading score [1]

Beta distribution

The PDF of random variable with a beta distribution is as follows.

f(y) = \begin{cases} \frac{y^{\alpha-1}(1-y)^{\beta-1}}{B(\alpha,\beta)}, & 0 \le y \le 1 \\ 0, & \text{elsewhere} \end{cases} Where B(\alpha,\beta) = \int_0^1 y^{\alpha-1}(1-y)^{\beta-1} \ dy = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}.

\alpha and \beta are the shape variables where \alpha > 0 \quad \beta > 0. [2]

Beta Distribution Mean and Variance

E[Y] = \mu = \frac{\alpha}{\alpha+\beta} \ \ \ \text{and} \ \ \ V[Y] = \sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} [2]

Introduction of \mu and \phi

For beta regression, it is useful to introduce the following

\mu = \frac{\alpha}{\alpha+\beta} \\ \phi = \alpha + \beta \mu is the mean of the beta regression while the higher the \phi the less the variance or the less spread out the PDF function is. [3]

Revised Beta Distribution

f(y; \mu, \phi) = \frac{\Gamma(\phi)}{\Gamma(\mu \phi) \Gamma((1 - \mu)\phi)} y^{\mu\phi - 1}(1 - y)^{(1 - \mu)\phi - 1}, \quad 0 < y < 1 Where:
- (μ) is the mean, (ϕ) is the precision (inverse of the variance), (Γ) is the gamma function.

Beta Distribution Variance

\text{Var}(Y) = \frac{\mu(1 - \mu)}{1 + \phi} When \mu is near the extremes, 0 or 1, variance drops. [4]

Extended Beta Regression

Bias Correction/Reduction - Type of Estimator:

  • ML (Maximum Likelihood): Standard method, useful but may yield biased estimates in certain conditions.[5]
  • BC (Bias-Corrected): Adjusts estimates to correct for bias, providing more reliable parameter values.
  • BR (Bias-Reduced): Shrinks estimates towards a central value, which can improve predictive performance.

Bias Correction/Reduction


Call:
betareg(formula = m1, data = suicide_dataset, type = "BC")

Quantile residuals:
    Min      1Q  Median      3Q     Max 
-4.5019 -0.5515 -0.0544  0.4490  6.5013 

Coefficients (mean model with logit link):
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -5.490e+00  1.530e-01 -35.872  < 2e-16 ***
HDI_year                   3.308e+00  2.030e-01  16.297  < 2e-16 ***
GDP_capita                -8.702e-06  7.265e-07 -11.978  < 2e-16 ***
sexmale                    8.151e-01  1.847e-02  44.125  < 2e-16 ***
age25-34 years             8.994e-02  3.226e-02   2.787  0.00531 ** 
age35-54 years             8.603e-02  3.967e-02   2.169  0.03010 *  
age5-14 years             -9.579e-01  4.562e-02 -20.998  < 2e-16 ***
age55-74 years            -1.345e-01  5.281e-02  -2.546  0.01090 *  
age75+ years              -1.821e-01  5.994e-02  -3.038  0.00238 ** 
generationG.I. Generation  4.913e-01  4.812e-02  10.208  < 2e-16 ***
generationGeneration X    -2.116e-01  3.688e-02  -5.738  9.6e-09 ***
generationGeneration Z    -5.561e-01  6.738e-02  -8.253  < 2e-16 ***
generationMillenials      -4.287e-01  4.574e-02  -9.372  < 2e-16 ***
generationSilent           9.104e-02  3.668e-02   2.482  0.01308 *  

Phi coefficients (precision model with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.149e+00  2.078e-01  10.344  < 2e-16 ***
HDI_year    6.433e-01  2.881e-01   2.233   0.0255 *  
GDP_capita  8.314e-06  1.150e-06   7.232 4.75e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: BC (bias-corrected)
Log-likelihood: 1.917e+04 on 17 Df
Pseudo R-squared: 0.4625
Number of iterations: 33 (BFGS) + 2 (Fisher scoring) 

Beta Regression Trees

  • This extension uses recursive partitioning to model data that might exhibit subgroup-specific relationships.
  • It builds decision trees by splitting data into different subgroups based on the instability of model parameters across partitioning variables.

Diagnostic Plots

The package betareg allows users to perform both fixed and variable dispersion beta regression. The model is based on the beta distribution, using a parameterization with the mean and precision[6].

Reading Skills Test Data Set

  • 44 children
  • 19 dyslexic / 25 normal
  • Test scores range from 0.0 to 1.0

Reading Skills Data Set Regressors

  • IQ (Z-score)
    • Min -1.745
    • Median -0.122
    • Max 1.856
  • Dyslexia
    • Yes
    • No

Reading Skills Dataset Tweaking

  • Dyslexia
    • No -> 0.0
    • Yes -> 1.0
  • Reading Score
    • 1.0 -> 0.99

Remember dependent variable is in open interval (0, 1)

Beta Regression Fitting

betareg(
  formula = accuracy ~ dcode * iq,
  data = ReadingSkillsModel,
  type = "BC",
)
  • Phi modeled as constant for higher psuedo R^2
  • BC is Bias Correction

General Linear Regression

glm(
  formula = accuracy ~ dcode * iq,
  family = gaussian(link = "logit"), 
  data = ReadingSkillsModel,
)
  • logit maps (0, 1) to \mathbb{R}

Data Cleaning

A point is extreme and removed if

  • Cooks Distance > 4 / N OR
  • Leverage > 2P / N. P is rank of model OR
  • Residual > 3 standard deviations

Results for Normal Children

Results for Dyslexic Children

Dyslexia’s effect on scores

A child’s odds of answering a reading skills question correctly decreases by a factor of e^{2.277} if they are dyslexic assuming normal IQ.

IQ’s effect on scores

If a normal child’s IQ increases by 1 standard deviation, their odds of answering a reading skills question correctly increases by a factor of e^{0.3205}

IQ’s effect on scores cont’d

If a dyslexic child’s IQ increases by 1 standard deviation, their odds of answering a reading skills question correctly decreases by a factor of e^{0.0532}

0.0532 = 0.3737 - 0.3205

Conclusion

  • Effective for proportion data, Ideal for modeling data bounded in the (0, 1) range.
  • Models both mean and precision, managing boundary cases and latent heterogeneity.
  • Bias correction and beta regression trees expand its capabilities.
  • The betareg package in R offers a powerful, flexible framework for analysts.

References

[1]
M. Smithson and J. Verkuilen, “A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables,” Psychol. Methods, vol. 11, no. 1, pp. 54–71, Mar. 2006, doi: 10.1037/1082-989X.11.1.54.
[2]
D. D. Wackerly, Mathematical statistics with applications, 6th ed. Duxbury Press, 2002.
[3]
S. Ferrari and F. Cribari-Neto, “Beta regression for modelling rates and proportions,” J. Appl. Stat., vol. 31, no. 7, pp. 799–815, Aug. 2004, doi: 10.1080/0266476042000214501.
[4]
S. Ferrari and F. Cribari-Neto, “Beta regression for modelling rates and proportions,” J. Appl. Stat., vol. 31, no. 7, pp. 799–815, Aug. 2004.
[5]
B. Grün, I. Kosmidis, and A. Zeileis, “Extended beta regression inR: Shaken, stirred, mixed, and partitioned,” J. Stat. Softw., vol. 48, no. 11, 2012.
[6]
A. Zeileis, F. Cribari-Neto, B. Grün, and I. Kosmidis, “Betareg: Beta regression.” The R Foundation, Apr. 2004.